multinomial (Multinomial distribution)#
The multinomial distribution models counts across K categories when you run n independent categorical trials with fixed category probabilities.
This notebook uses the same parameterization as scipy.stats.multinomial:
n= total number of trials (a non-negative integer)p= probability vector of lengthKwith entries in ([0,1]) that sums to 1
Learning goals#
By the end you should be able to:
recognize when a multinomial model is appropriate (and when it isn’t)
write down the PMF and a useful notion of a multivariate CDF
compute and interpret the mean vector, covariance, and key properties
derive the likelihood and the MLE for
psample from a multinomial using NumPy-only algorithms
visualize the PMF/CDF for
K=3on the simplex (triangle)use
scipy.stats.multinomialfor PMF/moments/sampling and implement missing pieces (CDF/fit) yourself
Prerequisites#
Basic probability (PMF/CDF), expectation, variance, covariance
Multinomial coefficients / factorials
Comfort with logs and derivatives
Table of contents#
Title & Classification
Intuition & Motivation
Formal Definition
Moments & Properties
Parameter Interpretation
Derivations
Sampling & Simulation
Visualization
SciPy Integration
Statistical Use Cases
Pitfalls
Summary
import math
import numpy as np
import plotly.graph_objects as go
import os
import plotly.io as pio
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
rng = np.random.default_rng(7)
np.set_printoptions(precision=6, suppress=True)
1) Title & Classification#
Name: multinomial (Multinomial distribution)
Type: Discrete (multivariate counts)
Support (for K categories and total count n):
Parameter space:
\(n \in \{0,1,2,\dots\}\)
\(p=(p_1,\dots,p_K)\) with \(p_i\ge 0\) and \(\sum_{i=1}^K p_i = 1\)
typically \(K\ge 2\)
We write:
where \(X_i\) counts how many times category \(i\) occurred.
2) Intuition & Motivation#
What this distribution models#
A clean mental model is rolling a K-sided die n times.
If the probability of face \(i\) is \(p_i\), and \(X_i\) counts how many times you saw face \(i\), then
Key idea:
The multinomial is a multivariate generalization of the binomial: it counts how many times each category occurs.
Typical real-world use cases#
Survey / A-B-n experiments: how many users chose each option.
NLP (bag-of-words): word counts in a document given a fixed word-probability vector.
Quality control: counts of defect types.
Genetics: allele counts across categories.
Marketing / ads: impressions or clicks split across multiple creatives.
Relations to other distributions#
Categorical (a.k.a. multinoulli): if \(n=1\), then \(X\) is one-hot and you recover a categorical draw.
Binomial: if \(K=2\), then \(X_1\sim\mathrm{Bin}(n,p_1)\) and \(X_2=n-X_1\).
Binomial marginals: for any \(i\), \(X_i\sim\mathrm{Bin}(n,p_i)\) (but the components are dependent).
Poisson splitting: if \(Y_i\stackrel{\text{ind}}{\sim}\mathrm{Poisson}(\lambda p_i)\), then \((Y_1,\dots,Y_K)\mid \sum_i Y_i=n\) is multinomial \(\mathrm{Multinomial}(n,p)\).
Dirichlet–multinomial: if \(p\) is random with a Dirichlet prior and then \(X\mid p\) is multinomial, you get overdispersion.
3) Formal Definition#
Let \(X=(X_1,\dots,X_K)\) be a count vector with \(\sum_i X_i = n\).
PMF#
For \(x\in\mathcal{S}_{n,K}\):
If \(x\notin\mathcal{S}_{n,K}\), the probability is 0.
A common shorthand is the multinomial coefficient:
CDF (lower-orthant CDF)#
A standard multivariate analogue of a CDF is the lower-orthant probability:
where \(y\le x\) means componentwise inequality.
Notes:
There is no simple closed form for this CDF in general.
Because \(\sum_i X_i=n\) almost surely, you need \(\sum_i x_i\ge n\) to get a nonzero CDF.
For visualization, it’s often more informative to leave some components unconstrained, e.g.
def _validate_n(n):
if isinstance(n, bool) or not isinstance(n, (int, np.integer)):
raise TypeError("n must be an integer")
n_int = int(n)
if n_int < 0:
raise ValueError("n must be >= 0")
return n_int
def _validate_p(p):
p = np.asarray(p, dtype=float)
if p.ndim != 1:
raise ValueError("p must be a 1D probability vector")
if p.size < 2:
raise ValueError("p must have length K>=2")
if not np.all(np.isfinite(p)):
raise ValueError("p must be finite")
if np.any(p < 0):
raise ValueError("p must be non-negative")
s = float(p.sum())
if not np.isclose(s, 1.0, atol=1e-12, rtol=0.0):
raise ValueError(f"p must sum to 1 (got {s})")
if s != 1.0:
p = p / s
return p
def _validate_counts(x, k, *, require_sum_n=None):
x = np.asarray(x)
if x.ndim == 1:
x = x[None, :]
if x.ndim != 2 or x.shape[1] != k:
raise ValueError(f"x must have shape (k,) or (m,k) with k={k}")
if not np.issubdtype(x.dtype, np.integer):
if np.any(np.abs(x - np.round(x)) > 0):
raise ValueError("x must contain integers")
x = np.round(x).astype(int)
else:
x = x.astype(int)
if np.any(x < 0):
raise ValueError("x must be nonnegative")
if require_sum_n is not None:
row_sums = x.sum(axis=1)
if np.any(row_sums != require_sum_n):
raise ValueError("Each row of x must sum to n")
return x
_lgamma_vec = np.vectorize(math.lgamma)
def multinomial_logpmf(x, n, p):
n = _validate_n(n)
p = _validate_p(p)
x = _validate_counts(x, k=p.size, require_sum_n=n)
if n == 0:
out = np.zeros(x.shape[0], dtype=float)
return out[0] if out.size == 1 else out
log_coeff = math.lgamma(n + 1.0) - np.sum(_lgamma_vec(x + 1.0), axis=1)
log_p = np.where(p > 0, np.log(p), -np.inf)
# Interpret 0 * log(0) = 0 (limit).
x_log_p = np.where(x == 0, 0.0, x * log_p[None, :])
log_prob = np.sum(x_log_p, axis=1)
out = log_coeff + log_prob
return out[0] if out.size == 1 else out
def multinomial_pmf(x, n, p):
return np.exp(multinomial_logpmf(x, n=n, p=p))
def compositions(n, k):
# Generate all k-tuples of nonnegative integers summing to n (stars and bars).
if k == 1:
yield (n,)
return
for i in range(n + 1):
for tail in compositions(n - i, k - 1):
yield (i,) + tail
def enumerate_support(n, k):
n = _validate_n(n)
if k < 1:
raise ValueError("k must be >= 1")
return np.array(list(compositions(n, k)), dtype=int)
def multinomial_cdf_small_n(x, n, p):
# Lower-orthant CDF by brute-force summation (only feasible for small n,k).
n = _validate_n(n)
p = _validate_p(p)
x = _validate_counts(x, k=p.size)[0]
ys = enumerate_support(n=n, k=p.size)
mask = np.all(ys <= x[None, :], axis=1)
return float(np.sum(multinomial_pmf(ys[mask], n=n, p=p)))
def simplex_xy_3(counts):
# Map 3-category compositions to 2D barycentric coordinates for plotting.
counts = np.asarray(counts, dtype=float)
counts = np.atleast_2d(counts)
if counts.shape[1] != 3:
raise ValueError("simplex_xy_3 expects shape (m,3)")
n = counts.sum(axis=1)
if np.any(n <= 0):
raise ValueError("All rows must sum to a positive n")
p = counts / n[:, None]
x = p[:, 1] + 0.5 * p[:, 2]
y = (np.sqrt(3) / 2.0) * p[:, 2]
return x, y
# Quick sanity check: PMF sums to 1 on the support
n = 6
p = np.array([0.2, 0.3, 0.5])
support = enumerate_support(n=n, k=p.size)
pmf = multinomial_pmf(support, n=n, p=p)
{
"support_size": int(support.shape[0]),
"pmf_sum": float(pmf.sum()),
"min_pmf": float(pmf.min()),
"max_pmf": float(pmf.max()),
}
{'support_size': 28,
'pmf_sum': 1.000000000000001,
'min_pmf': 6.400000000000006e-05,
'max_pmf': 0.13500000000000018}
4) Moments & Properties#
Let \(X\sim\mathrm{Multinomial}(n,p)\) with \(p\in\mathbb{R}^K\) and \(\sum_i p_i=1\).
Mean and covariance#
For each component:
The variance and covariance are:
Equivalently, the covariance matrix is
A key qualitative property is negative dependence: if one category count is high, others must (on average) be lower because the total is fixed.
Skewness and kurtosis (marginals)#
Each component \(X_i\) is marginally binomial: \(X_i\sim\mathrm{Bin}(n,p_i)\). Therefore, for each \(i\):
(These are univariate skewness/kurtosis of the marginals; multivariate notions also exist.)
MGF and characteristic function#
For a vector \(t\in\mathbb{R}^K\), the multivariate moment generating function is
The characteristic function (\(\omega\in\mathbb{R}^K\)) is
Entropy#
The entropy is
which generally has no simple closed form.
For large \(n\) (and all \(p_i>0\)), a useful approximation comes from a multivariate normal approximation on the \((K-1)\)-dimensional simplex:
Other useful properties#
Sum constraint: \(\sum_i X_i = n\) almost surely, so the covariance matrix has rank \(K-1\).
Merging categories: if you merge two categories, you get another multinomial with merged probability.
Conditionals: given some counts, the remaining counts are multinomial (or binomial in the sequential decomposition).
# Verify mean/covariance with Monte Carlo (using NumPy's built-in multinomial sampler)
def mean_cov_multinomial(n, p):
n = _validate_n(n)
p = _validate_p(p)
mean = n * p
cov = n * (np.diag(p) - np.outer(p, p))
return mean, cov
n = 40
p = np.array([0.15, 0.35, 0.20, 0.30])
mean_theory, cov_theory = mean_cov_multinomial(n, p)
samples = rng.multinomial(n, p, size=200_000)
mean_mc = samples.mean(axis=0)
cov_mc = np.cov(samples.T, ddof=0)
{
"mean_theory": mean_theory,
"mean_mc": mean_mc,
"max_abs_mean_err": float(np.max(np.abs(mean_mc - mean_theory))),
"max_abs_cov_err": float(np.max(np.abs(cov_mc - cov_theory))),
}
{'mean_theory': array([ 6., 14., 8., 12.]),
'mean_mc': array([ 6.004515, 13.99794 , 7.99156 , 12.005985]),
'max_abs_mean_err': 0.008440000000000225,
'max_abs_cov_err': 0.03195538522499941}
5) Parameter Interpretation#
n (total count)#
Controls the scale of the counts.
As
nincreases with fixedp, the distribution concentrates around its mean \(np\) and (after centering/scaling) becomes close to multivariate normal.
p (category probabilities)#
Controls the direction of the mean and where the mass sits on the simplex.
If one \(p_i\) is large, most mass is near the corresponding vertex (most trials land in that category).
If
pis close to uniform, mass is concentrated near the center (counts are balanced).
For K=3: points \(x=(x_1,x_2,x_3)\) lie on a triangle (the 2-simplex) because \(x_1+x_2+x_3=n\).
# How p changes the shape (K=3): PMF on the simplex
def plot_simplex_pmf(n, p, *, title):
p = _validate_p(p)
if p.size != 3:
raise ValueError("This visualization expects K=3")
support = enumerate_support(n=n, k=3)
pmf = multinomial_pmf(support, n=n, p=p)
sx, sy = simplex_xy_3(support)
tri_x = [0.0, 1.0, 0.5, 0.0]
tri_y = [0.0, 0.0, float(np.sqrt(3) / 2.0), 0.0]
fig = go.Figure()
fig.add_trace(go.Scatter(x=tri_x, y=tri_y, mode="lines", line=dict(color="black"), showlegend=False))
fig.add_trace(
go.Scatter(
x=sx,
y=sy,
mode="markers",
marker=dict(
size=10,
color=pmf,
colorscale="Viridis",
colorbar=dict(title="PMF"),
line=dict(width=0.2, color="rgba(0,0,0,0.2)"),
),
text=[f"x={tuple(row)}, pmf={val:.3e}" for row, val in zip(support, pmf)],
hoverinfo="text",
showlegend=False,
)
)
fig.update_layout(
title=title,
xaxis_title="barycentric x",
yaxis_title="barycentric y",
xaxis=dict(range=[-0.05, 1.05], zeroline=False),
yaxis=dict(
scaleanchor="x",
scaleratio=1,
range=[-0.05, float(np.sqrt(3) / 2.0) + 0.05],
zeroline=False,
),
)
fig.add_annotation(x=0.0, y=0.0, text="cat 1", showarrow=False, yshift=-12)
fig.add_annotation(x=1.0, y=0.0, text="cat 2", showarrow=False, yshift=-12)
fig.add_annotation(x=0.5, y=float(np.sqrt(3) / 2.0), text="cat 3", showarrow=False, yshift=12)
fig.show()
n = 18
plot_simplex_pmf(n, p=[1 / 3, 1 / 3, 1 / 3], title="PMF on simplex (n=18, p uniform)")
plot_simplex_pmf(n, p=[0.70, 0.20, 0.10], title="PMF on simplex (n=18, p favors cat 1)")
plot_simplex_pmf(n, p=[0.10, 0.15, 0.75], title="PMF on simplex (n=18, p favors cat 3)")
6) Derivations#
A convenient derivation uses one-hot indicator variables.
Let \(Z_t\in\{e_1,\dots,e_K\}\) be the one-hot vector for trial \(t\), with
and define counts as
Expectation#
By linearity of expectation:
Variance and covariance#
For a fixed trial \(t\), the indicators satisfy:
because \(Z_{t,i}Z_{t,j}=0\) when \(i\ne j\).
Across trials \(t\ne s\), independence gives zero covariance. Therefore:
and for \(i\ne j\):
Likelihood and MLE#
For one observed count vector \(x\in\mathcal{S}_{n,K}\), the likelihood (as a function of \(p\)) is
The log-likelihood (dropping constants independent of \(p\)) is
Using a Lagrange multiplier for the constraint gives the MLE:
For multiple independent observations \(x^{(1)},\dots,x^{(m)}\) (possibly with different totals \(n_j\)), the MLE becomes
# MLE for p from multinomial samples
def fit_multinomial_mle(counts):
counts = np.asarray(counts)
if counts.ndim == 1:
counts = counts[None, :]
if counts.ndim != 2:
raise ValueError("counts must have shape (k,) or (m,k)")
if np.any(counts < 0):
raise ValueError("counts must be nonnegative")
totals = counts.sum(axis=1)
if np.any(totals == 0):
raise ValueError("each observation must have positive total count")
total_counts = counts.sum(axis=0)
return total_counts / total_counts.sum()
n = 25
p_true = np.array([0.10, 0.25, 0.05, 0.20, 0.40])
data = rng.multinomial(n, p_true, size=5_000)
p_hat = fit_multinomial_mle(data)
{
"p_true": p_true,
"p_hat": p_hat,
"L1_error": float(np.sum(np.abs(p_hat - p_true))),
}
{'p_true': array([0.1 , 0.25, 0.05, 0.2 , 0.4 ]),
'p_hat': array([0.100496, 0.250112, 0.049712, 0.199728, 0.399952]),
'L1_error': 0.0012160000000000712}
7) Sampling & Simulation#
Below are two NumPy-only strategies.
A) Count categorical draws#
Draw \(n\) iid categorical outcomes with probabilities \(p\).
Count how many times each category appears.
This is the definition of the model, but it costs \(O(n)\) work per sample.
B) Sequential binomials (conditional decomposition)#
A very useful identity is that a multinomial can be generated as a sequence of conditional binomials:
…and so on, with the last component determined by the remaining count.
This is often faster than simulating all \(n\) categorical trials when n is large.
def multinomial_rvs_categorical_counts(n, p, size=1, *, rng: np.random.Generator):
n = _validate_n(n)
p = _validate_p(p)
k = p.size
if n == 0:
out = np.zeros((size, k), dtype=int)
return out[0] if size == 1 else out
draws = rng.choice(k, size=(size, n), p=p)
out = np.empty((size, k), dtype=int)
for i in range(size):
out[i] = np.bincount(draws[i], minlength=k)
return out[0] if size == 1 else out
def multinomial_rvs_sequential_binomial(n, p, size=1, *, rng: np.random.Generator):
n = _validate_n(n)
p = _validate_p(p)
k = p.size
if n == 0:
out = np.zeros((size, k), dtype=int)
return out[0] if size == 1 else out
out = np.zeros((size, k), dtype=int)
remaining_n = np.full(size, n, dtype=int)
remaining_prob = 1.0
for i in range(k - 1):
if p[i] == 0.0:
xi = np.zeros(size, dtype=int)
elif remaining_prob == p[i]:
xi = remaining_n.copy()
else:
pi_cond = p[i] / remaining_prob
xi = rng.binomial(remaining_n, pi_cond)
out[:, i] = xi
remaining_n = remaining_n - xi
remaining_prob = remaining_prob - p[i]
out[:, -1] = remaining_n
return out[0] if size == 1 else out
# Compare the two NumPy-only samplers
n = 30
p = np.array([0.2, 0.1, 0.3, 0.4])
size = 100_000
s1 = multinomial_rvs_categorical_counts(n, p, size=size, rng=rng)
s2 = multinomial_rvs_sequential_binomial(n, p, size=size, rng=rng)
mean_theory, cov_theory = mean_cov_multinomial(n, p)
summary = {
"max_abs_mean_err_categorical": float(np.max(np.abs(s1.mean(axis=0) - mean_theory))),
"max_abs_mean_err_sequential": float(np.max(np.abs(s2.mean(axis=0) - mean_theory))),
"max_abs_cov_err_categorical": float(np.max(np.abs(np.cov(s1.T, ddof=0) - cov_theory))),
"max_abs_cov_err_sequential": float(np.max(np.abs(np.cov(s2.T, ddof=0) - cov_theory))),
}
summary
{'max_abs_mean_err_categorical': 0.019259999999999167,
'max_abs_mean_err_sequential': 0.012549999999999173,
'max_abs_cov_err_categorical': 0.06312542559999734,
'max_abs_cov_err_sequential': 0.05956750249998599}
8) Visualization#
We’ll visualize:
the PMF for
K=3on the simplex (triangle)a useful 2D CDF slice: \(\Pr(X_1\le a, X_2\le b)\) (leaving the third count unconstrained)
Monte Carlo samples on the simplex
# PMF on the simplex (K=3)
n = 20
p = np.array([0.25, 0.50, 0.25])
support = enumerate_support(n=n, k=3)
pmf = multinomial_pmf(support, n=n, p=p)
sx, sy = simplex_xy_3(support)
tri_x = [0.0, 1.0, 0.5, 0.0]
tri_y = [0.0, 0.0, float(np.sqrt(3) / 2.0), 0.0]
fig = go.Figure()
fig.add_trace(go.Scatter(x=tri_x, y=tri_y, mode="lines", line=dict(color="black"), showlegend=False))
fig.add_trace(
go.Scatter(
x=sx,
y=sy,
mode="markers",
marker=dict(size=10, color=pmf, colorscale="Viridis", colorbar=dict(title="PMF")),
text=[f"x={tuple(row)}, pmf={val:.3e}" for row, val in zip(support, pmf)],
hoverinfo="text",
showlegend=False,
)
)
fig.update_layout(
title=f"Multinomial PMF on the simplex (n={n}, p={p.tolist()})",
xaxis_title="barycentric x",
yaxis_title="barycentric y",
xaxis=dict(range=[-0.05, 1.05], zeroline=False),
yaxis=dict(scaleanchor="x", scaleratio=1, range=[-0.05, float(np.sqrt(3) / 2.0) + 0.05], zeroline=False),
)
fig.show()
# 2D CDF slice: P(X1 <= a, X2 <= b) for K=3
# This equals the lower-orthant CDF evaluated at (a, b, n).
n = 20
p = np.array([0.25, 0.50, 0.25])
support = enumerate_support(n=n, k=3)
pmf = multinomial_pmf(support, n=n, p=p)
cdf = np.zeros((n + 1, n + 1), dtype=float)
for a in range(n + 1):
for b in range(n + 1):
mask = (support[:, 0] <= a) & (support[:, 1] <= b)
cdf[a, b] = pmf[mask].sum()
fig = go.Figure(
data=go.Heatmap(
z=cdf,
x=np.arange(n + 1),
y=np.arange(n + 1),
colorscale="Blues",
colorbar=dict(title="P(X1≤a, X2≤b)"),
)
)
fig.update_layout(
title=f"2D CDF slice: P(X1≤a, X2≤b) (n={n}, p={p.tolist()})",
xaxis_title="b (upper bound for X2)",
yaxis_title="a (upper bound for X1)",
)
fig.show()
# Monte Carlo samples on the simplex (K=3)
n = 20
p = np.array([0.25, 0.50, 0.25])
samples = multinomial_rvs_sequential_binomial(n, p, size=25_000, rng=rng)
x, y = simplex_xy_3(samples)
tri_x = [0.0, 1.0, 0.5, 0.0]
tri_y = [0.0, 0.0, float(np.sqrt(3) / 2.0), 0.0]
fig = go.Figure()
fig.add_trace(go.Scatter(x=tri_x, y=tri_y, mode="lines", line=dict(color="black"), showlegend=False))
fig.add_trace(
go.Scatter(
x=x,
y=y,
mode="markers",
marker=dict(size=4, opacity=0.15, color="rgba(31,119,180,1)"),
showlegend=False,
)
)
# Theoretical mean location
mean = n * p
mx, my = simplex_xy_3(mean)
fig.add_trace(
go.Scatter(
x=mx,
y=my,
mode="markers",
marker=dict(size=12, color="red"),
name="mean (theory)",
)
)
fig.update_layout(
title=f"Monte Carlo samples on simplex (n={n}, p={p.tolist()})",
xaxis_title="barycentric x",
yaxis_title="barycentric y",
xaxis=dict(range=[-0.05, 1.05], zeroline=False),
yaxis=dict(scaleanchor="x", scaleratio=1, range=[-0.05, float(np.sqrt(3) / 2.0) + 0.05], zeroline=False),
)
fig.show()
9) SciPy Integration#
SciPy provides a numerically robust implementation via scipy.stats.multinomial.
Supports
pmf/logpmf,rvs,mean,cov, andentropy.As of SciPy 1.15,
multinomialdoes not expose:a
cdfmethod (multivariate CDFs are expensive)a
.fit()method
For CDFs and fitting, you typically implement problem-specific utilities:
CDFs by enumeration for small
n,Kor by approximation.pestimation via the closed-form MLE \(\hat p_i = \tfrac{\sum_j x_i^{(j)}}{\sum_j n_j}\).
from scipy.stats import multinomial
n = 12
p = np.array([0.2, 0.3, 0.5])
support = enumerate_support(n=n, k=p.size)
pmf_scipy = multinomial.pmf(support, n=n, p=p)
pmf_ours = multinomial_pmf(support, n=n, p=p)
samples_scipy = multinomial.rvs(n=n, p=p, size=50_000, random_state=rng)
# SciPy doesn't have multinomial.cdf, but we can compute it for small n with enumeration.
example_cdf = multinomial_cdf_small_n([4, 4, 12], n=n, p=p)
{
"pmf_sum_scipy": float(pmf_scipy.sum()),
"pmf_sum_ours": float(pmf_ours.sum()),
"max_abs_pmf_diff": float(np.max(np.abs(pmf_scipy - pmf_ours))),
"mean_empirical": samples_scipy.mean(axis=0),
"mean_theory": multinomial.mean(n=n, p=p),
"entropy_scipy": float(multinomial.entropy(n=n, p=p)),
"example_cdf_small_n": float(example_cdf),
}
{'pmf_sum_scipy': 0.9999999999999992,
'pmf_sum_ours': 0.9999999999999982,
'max_abs_pmf_diff': 2.220446049250313e-16,
'mean_empirical': array([2.3973 , 3.60692, 5.99578]),
'mean_theory': array([2.4, 3.6, 6. ]),
'entropy_scipy': 3.5338043228652705,
'example_cdf_small_n': 0.6555341562499992}
# "Fit" p: closed-form MLE on SciPy-generated data
n = 30
p_true = np.array([0.10, 0.25, 0.05, 0.20, 0.40])
data = multinomial.rvs(n=n, p=p_true, size=5_000, random_state=rng)
p_hat = fit_multinomial_mle(data)
{
"p_true": p_true,
"p_hat": p_hat,
"L1_error": float(np.sum(np.abs(p_hat - p_true))),
}
{'p_true': array([0.1 , 0.25, 0.05, 0.2 , 0.4 ]),
'p_hat': array([0.100627, 0.250553, 0.04976 , 0.200673, 0.398387]),
'L1_error': 0.0037066666666667053}
10) Statistical Use Cases#
A) Hypothesis testing (goodness-of-fit)#
Given observed counts \(x\) and a null probability vector \(p^{(0)}\), a common test is a chi-square goodness-of-fit test.
Null: data comes from \(\mathrm{Multinomial}(n, p^{(0)})\)
Expected counts under the null: \(n p^{(0)}\)
Rule of thumb: chi-square approximations are best when expected counts aren’t too small.
B) Bayesian modeling (Dirichlet conjugacy)#
If
then the posterior is
C) Generative modeling#
Multinomial likelihoods appear whenever you model count vectors given probabilities, e.g.
bag-of-words document models
discrete emissions in mixture models
naive Bayes classifiers
from scipy.stats import chisquare, power_divergence
# A) Chi-square goodness-of-fit for one multinomial observation
n = 200
p0 = np.array([0.2, 0.5, 0.3])
# simulate an observation under an alternative
x_obs = multinomial.rvs(n=n, p=[0.25, 0.45, 0.30], size=1, random_state=rng)
expected = n * p0
chi2 = chisquare(f_obs=x_obs, f_exp=expected)
# Likelihood ratio (G-test): lambda_=0 gives the log-likelihood ratio statistic
# (Different approximations behave differently when expected counts are small.)
gtest = power_divergence(f_obs=x_obs, f_exp=expected, lambda_=0)
{
"x_obs": x_obs,
"p0": p0,
"chi2_stat": float(chi2.statistic),
"chi2_pvalue": float(chi2.pvalue),
"g_stat": float(gtest.statistic),
"g_pvalue": float(gtest.pvalue),
}
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_1039728/2878626965.py in ?()
7 # simulate an observation under an alternative
8 x_obs = multinomial.rvs(n=n, p=[0.25, 0.45, 0.30], size=1, random_state=rng)
9
10 expected = n * p0
---> 11 chi2 = chisquare(f_obs=x_obs, f_exp=expected)
12
13 # Likelihood ratio (G-test): lambda_=0 gives the log-likelihood ratio statistic
14 # (Different approximations behave differently when expected counts are small.)
~/miniconda3/lib/python3.12/site-packages/scipy/stats/_stats_py.py in ?(f_obs, f_exp, ddof, axis, sum_check)
7501 Power_divergenceResult(statistic=array([3.5 , 9.25]), pvalue=array([0.62338763, 0.09949846]))
7502
7503 For a more detailed example, see :ref:`hypothesis_chisquare`.
7504 """ # noqa: E501
-> 7505 return _power_divergence(f_obs, f_exp=f_exp, ddof=ddof, axis=axis,
7506 lambda_="pearson", sum_check=sum_check)
~/miniconda3/lib/python3.12/site-packages/scipy/stats/_stats_py.py in ?(f_obs, f_exp, ddof, axis, lambda_, sum_check)
7317 f"frequencies must agree with the sum of the "
7318 f"expected frequencies to a relative tolerance "
7319 f"of {rtol}, but the percent differences are:\n"
7320 f"{relative_diff}")
-> 7321 raise ValueError(msg)
7322
7323 else:
7324 # Ignore 'invalid' errors so the edge case of a data set with length 0
ValueError: For each axis slice, the sum of the observed frequencies must agree with the sum of the expected frequencies to a relative tolerance of 1.4901161193847656e-08, but the percent differences are:
[0.225 0.010101 0.153846]
# B) Dirichlet conjugacy: posterior update and posterior mean
def dirichlet_rvs_numpy(alpha, size, *, rng: np.random.Generator):
alpha = np.asarray(alpha, dtype=float)
if alpha.ndim != 1 or alpha.size < 2:
raise ValueError("alpha must be a 1D array with length >= 2")
if np.any(alpha <= 0) or not np.all(np.isfinite(alpha)):
raise ValueError("alpha must be positive and finite")
g = rng.gamma(shape=alpha, scale=1.0, size=(size, alpha.size))
return g / g.sum(axis=1, keepdims=True)
alpha_prior = np.array([1.0, 1.0, 1.0]) # uniform prior on 3-simplex
n = 50
p_true = np.array([0.2, 0.5, 0.3])
x = multinomial.rvs(n=n, p=p_true, size=1, random_state=rng)
alpha_post = alpha_prior + x
posterior_mean = alpha_post / alpha_post.sum()
{
"x": x,
"alpha_prior": alpha_prior,
"alpha_post": alpha_post,
"posterior_mean": posterior_mean,
}
{'x': array([[12, 25, 13]]),
'alpha_prior': array([1., 1., 1.]),
'alpha_post': array([[13., 26., 14.]]),
'posterior_mean': array([[0.245283, 0.490566, 0.264151]])}
# C) Generative modeling example: bag-of-words counts
vocab = ["cat", "dog", "fish", "tree", "car"]
p_words = np.array([0.30, 0.25, 0.05, 0.20, 0.20])
n_words = 80
counts = multinomial.rvs(n=n_words, p=p_words, size=1, random_state=rng)
{w: int(c) for w, c in zip(vocab, counts)}
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[15], line 9
6 n_words = 80
7 counts = multinomial.rvs(n=n_words, p=p_words, size=1, random_state=rng)
----> 9 {w: int(c) for w, c in zip(vocab, counts)}
TypeError: only length-1 arrays can be converted to Python scalars
11) Pitfalls#
Invalid parameters#
nmust be a non-negative integer.pmust be non-negative and sum to 1.For the PMF,
xmust be a non-negative integer vector with sum exactlyn.
Numerical issues#
Factorials explode quickly; compute PMFs in log-space when
nis moderate/large.Probabilities can underflow for rare events; prefer
logpmfwhen comparing likelihoods.
Modeling issues#
The multinomial assumes independent trials with a fixed probability vector
p. Overdispersion (extra variability across replicates) is common; a Dirichlet–multinomial can help.Categories must be well-defined and mutually exclusive; if observations can belong to multiple labels, the model changes.
CDF gotchas#
Multivariate CDFs are not unique in the same way as 1D CDFs (different “orders”/definitions exist).
Even for the standard lower-orthant CDF, computation is usually expensive beyond small
n,K.
12) Summary#
multinomialis a discrete multivariate distribution over count vectors \(x\in\mathbb{N}_0^K\) with \(\sum_i x_i=n\).PMF: \(\frac{n!}{\prod_i x_i!}\prod_i p_i^{x_i}\); mean \(np\); covariance \(n(\mathrm{diag}(p)-pp^\top)\).
Each component is marginally binomial, but components are negatively correlated.
The MLE for
pis the empirical frequency: \(\hat p_i = x_i/n\) (or pooled across observations).For computation and simulation, prefer
scipy.stats.multinomialand log-space methods; for CDFs you usually need custom code or approximations.
References
SciPy:
scipy.stats.multinomialStandard probability texts covering multinomial coefficients and multinomial sampling